Workshop Day 2B | 2022-07-26 Jeffrey M. Girard | Pitt Methods
Wrangle III
Summarize
Although we store data about many observations…
…we often want to summarize across observations
This is like folding the tibble down to one row
We’ve seen functions that summarize vectors
length(), sum(), min(), max()
mean(), median(), sd(), var()
summarize() lets us use them on tibbles
It works very similarly to mutate()
It always creates a tibble as output
Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# USECASE: Summarize the typical salesmy_summary <- sales |>summarize(avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# PITFALL: Don't use summary() instead of summarize()my_summary <- sales |>summary(avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# USECASE: Use more than one summary functionmy_summary <- sales |>summarize(total_items =sum(items),total_spent =sum(spent),avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# USECASE: Use counting functionsmy_counts <- sales |>summarize(n_sales =n(),n_customers =n_distinct(customer),n_stores =n_distinct(store) ) |>print()
Group Summarize
We can also summarize a tibble by group
This is like folding the tibble multiple times
Specifically, we fold down to one row per group
The syntax for summarize is identical
The only difference is to the tibble
We first pass it through group_by()
Pipelines make this very easy
Group Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# LESSON: We pass a tibble through group_by to group itsalessales |>group_by(store) # note the display says "grouped"# ==============================================================================# USECASE: We can then summarize and get stats per groupsales |>group_by(store) |>summarize(customers =n_distinct(customer),items_sold =sum(items),total_sales =sum(spent),avg_items =mean(items),avg_spent =mean(spent) )# ==============================================================================# SETUP: Let's get a larger, more realistic dataset# Extra pane > Packages tab > Install > nycflights13library("nycflights13")flights# ==============================================================================# USECASE: Find the carrier with the lowest average delaysflights |>group_by(carrier) |>summarize(m_delay =mean(dep_delay, na.rm =TRUE)) |>arrange(m_delay)# ==============================================================================# LESSON: We can also group by multiple variables# USECASE: Let's find the day of the year with the most flightsflights |>group_by(month, day) |>summarize(n_flights =n()) |>arrange(desc(n_flights))# ==============================================================================# PITFALL: Note how this differs from grouping by just dayflights |>group_by(day)|>summarize(n_flights =n()) |>arrange(desc(n_flights))
Pivot Longer and Wider
Both long and wide formats can be tidy
Long formats are better for MLM
Wide formats are better for SEM
It can be useful to quickly reshape a tibble
pivot_longer(): wide → long
pivot_wider(): long → wide
Pivot Longer Live Coding
# SETUP: We will need tidyverse and an example dataset (from workshop website)library(tidyverse)gradebook <-read_csv("gradebook.csv") |>print()# ==============================================================================# USECASE: We can pivot to long format by creating name and value variablesgradebook2 <- gradebook |>pivot_longer(cols =c(test1, test2, test3, test4, test5), names_to ="test", values_to ="grade" ) |>print()# ==============================================================================# TIP: Use selection helpers to select columns quicklygradebook2 <- gradebook |>pivot_longer(cols = test1:test5, names_to ="test", values_to ="grade" ) |>print()# ==============================================================================# LESSON: Automatically remove the name prefixgradebook2 <- gradebook|>pivot_longer(cols =starts_with("test"), names_to ="test", values_to ="grade",names_prefix ="test" ) |>print()
Pivot Wider Live Coding
# SETUP: We will need tidyverse and an example dataset (from workshop website)library(tidyverse)diary <-read_csv("diary.csv") |>print()# ==============================================================================# USECASE: Reshape this long format to a wider formatdiary_scale <- diary |>pivot_wider(names_from ="scale",values_from ="score" ) |>print()diary_day <- diary |>pivot_wider(names_from ="day",values_from ="score" ) |>print()# NOTE: There are thus multiple possible wide formats (for different uses)# ==============================================================================# LESSON: We can add a prefix to each name to avoid numeric namesdiary_datadiary_day <- diary |>pivot_wider(names_from ="day",values_from ="score",names_prefix ="day_" ) |>print()# ==============================================================================# LESSON: We can also pivot on multiple columns at oncediary_double <- diary |>pivot_wider(names_from =c("scale", "day"),values_from ="score" ) |>print()
Visualize I
What is a graphic?
A data visualization expresses data through visual aesthetics.
Describing Graphics
Some simple graphics are easy to describe and may even have ready names.
Describing Graphics
A grammar of graphics will help us describe more complex graphics.
The Grammar of Graphics
The grammar of graphics is a set of rules for describing and creating data visualizations
To make our data visual (and therefore put our highly evolved occipital lobes to work)…
We connect variables to visual qualities
We represent observations as visual objects
This requires four fundamental elements
We will first learn about them in lecture
We will then apply them in R using {ggplot2}
Data
Graphics require data (e.g., tibbles), which describe observations using variables.
Aesthetic Mappings
Graphics require aesthetic mappings, which connect data variables to visual qualities.
Scales
Graphics require scales, which connect specific data values to specific aesthetic values.
Geometric Objects
Graphics require geometric objects (geoms), which represent the observations.
ggplot2 Basics
The ggplot2 package is a part of tidyverse
No need to install or load it separately
It plays nicely with tibbles and wrangling
It implements the grammar of graphics in R
The “gg” stands for “grammar of graphics”
Thus, we will need to provide all four elements
We will create a pseudo-pipeline of commands
However, we will use + rather than |>
This is because {ggplot2} predates the R pipe
ggplot2 Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# LESSON: First, set the data to a tibblep <-ggplot(data = mpg)p# ==============================================================================# LESSON: Next, set the aesthetic mappings with aes()p <-ggplot(data = mpg, mapping =aes(x = displ, y = hwy))p# ==============================================================================# TIP: You can leave off the optional argument namesp <-ggplot(mpg, aes(x = displ, y = hwy))p# ==============================================================================# LESSON: Next, set the positional scalesp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) )p# ==============================================================================# LESSON: Finally, add a point geomp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) ) +geom_point()# ==============================================================================# TIP: If you leave off the scales, R will try to guessp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point()p# ==============================================================================# LESSON: We can also customize the geom with argumentsp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red", shape ="square", size =2)p
Basic Layering
ggplot2 uses a layered grammar of graphics
We can keep stacking geoms on top
Layering adds a lot of possibilities
We can convey more complex ideas
We can learn more about our data
But we can still describe these graphics
Just describe each layer in turn
And describe the layers’ ordering
Basic Layering Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Add a smooth geom (i.e., line of best fit)ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth()ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth(method ="lm")# ==============================================================================# USECASE: Add a line geom (i.e., connecting points)economicsggplot(economics, aes(x = date, y = unemploy)) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_point() +geom_line(color ="orange", size =1)ggplot(economics, aes(x = date, y = unemploy)) +geom_line(color ="orange", size =1) +geom_point()# ==============================================================================# USECASE: Add reference line geomsggplot(economics, aes(x = date, y = unemploy)) +geom_hline(yintercept =0, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_vline(xintercept =7.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point() ggplot(economics, aes(x = date, y = unemploy)) +geom_abline(intercept =4000, slope =0.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()
Distribution Geoms
Variable distributions are critical in data analysis
What are the most and least common values?
What are the extrema (min and max values)?
Are there any outliers or impossible values?
How much spread is there in the variable?
What shape does the distribution take?
Visualization is a quick way to assess this
They can also communicate it to others
Distribution Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Creating histogramsggplot(mpg, aes(x = hwy)) +geom_histogram()ggplot(mpg, aes(x = hwy)) +geom_histogram(bins =20)ggplot(mpg, aes(x = hwy)) +geom_histogram(binwidth =2)ggplot(mpg, aes(x = hwy)) +geom_histogram(binwidth =2, color ="red", size =1)ggplot(mpg, aes(x = hwy)) +geom_histogram(binwidth =2, color ="red", size =1, fill ="white")# ==============================================================================# USECASE: Creating density plotsggplot(mpg, aes(x = hwy)) +geom_density()ggplot(mpg, aes(x = hwy)) +geom_density(color ="red", size =1, fill ="white")# ==============================================================================# USECASE: Creating box plotsggplot(mpg, aes(x = hwy)) +geom_boxplot()ggplot(mpg, aes(x = hwy, y = class)) +geom_boxplot(varwidth =TRUE)# ==============================================================================# USECASE: Creating bar plots to count categorical variablesggplot(mpg, aes(x = class)) +geom_bar()# ==============================================================================# PITFALL: Don't try to create histograms for categorical variablesggplot(mpg, aes(x = class)) +geom_histogram() #error
Working with Color
Color scales come in two main types:
Discrete scales have separate colors
Best with factor variables
Continuous scales form a gradient
Best with numeric variables
There are two ways to control color:
You can map color to a variable
It will take on different values
You can set color to a value
It will take on one value only
Color Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Continuous color scales work well with numeric variablesggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4)ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4) +scale_color_continuous(type ="viridis")# ==============================================================================# USECASE: Use a discrete color scale with categorical variablesggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point()ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +scale_color_discrete(name ="Drivetrain", breaks =c("4", "f", "r"), labels =c("Four Wheel", "Front Wheel", "Rear Wheel") )# ==============================================================================# PITFALL: Don't forget to set categorical variables as factorsggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +geom_point() # R guesses you want a continuous scaleggplot(mpg, aes(x = displ, y = hwy, color =factor(cyl))) +geom_point() +scale_color_discrete(name ="Cylinders")# ==============================================================================# LESSON: Set a geom's color aesthetic to make it always that colorggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red")# ==============================================================================# PITFALL: However, do this inside of geom() not aes()ggplot(mpg, aes(x = displ, y = hwy, color ="blue")) +geom_point() #unintended# ==============================================================================# LESSON: If you both set and map color, the setting will winggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point(color ="blue")
Other Aesthetics
For blocky elements like bars…
color controls the outline color
fill controls the internal color
size controls the line thickness
Some mappings will induce grouping
You’ll get separate geoms per group
It can be helpful to use redundant mapping
Map one variable to multiple aesthetics
Then if one “fails” the other may work
Other Aesthetics Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Mapping the shape aesthetic to a categorical variableggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +geom_point(size =3)# ==============================================================================# PITFALL: Don't try to map shape to a continuous variableggplot(mpg, aes(x = displ, y = hwy, shape = hwy)) +geom_point() #error# NOTE: This doesn't work because there are way more numbers than shapes# ==============================================================================# LESSON: Color vs. Fill and Size for Blocksggplot(mpg, aes(y = class)) +geom_bar()ggplot(mpg, aes(y = class)) +geom_bar(color ="darkred", fill ="lightblue", size =1)# ==============================================================================# LESSON: Some aesthetics cause grouping when mapped to a categorical variableggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth(method ="lm") # single smoothggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +geom_smooth(method ="lm") # three smooths# ==============================================================================# USECASE: Mapping to the fill aesthetic and setting the alpha propertyggplot(mpg, aes(x = hwy, fill = drv)) +geom_density()ggplot(mpg, aes(x = hwy, fill = drv)) +geom_density(alpha =0.3)# ==============================================================================# TIP: If you map the same variable to multiple aesthetics, you get redundancyggplot(mpg, aes(x = displ, y = hwy, shape = drv, color = drv)) +geom_point(size =3) # if color fails, shape still works
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +labs(title ="Fuel Efficiency")p# ==============================================================================# USECASE: Apply a "complete" themep +theme_bw()p +theme_classic()# ==============================================================================# TIP: You can quickly change the font size of all elements with base_sizep +theme_grey(base_size =24)# ==============================================================================# LESSON: The ggthemes package adds some fun complete themeslibrary(ggthemes)p +theme_wsj()p +theme_economist()p +theme_stata()# ==============================================================================# LESSON: More more precise control, we can use theme()p +theme(legend.position ="top")p +theme(plot.title =element_text(color ="purple", face ="bold"))p +theme(panel.grid =element_blank())# NOTE: There are a lot of elements to learn, so use a cheatsheet!
Exporting Graphics
We may need to export graphics from R
e.g., for a paper, poster, or presentation
This job is handling fantastically by ggsave()
We can create many types of files
We can customize the exact size
I recommend .png for most daily purposes
For publishing, I prefer .pdf or .svg
They retain perfect quality at any zoom
You can send these files to most publishers
Exporting Live Coding
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth() +labs(x ="Engine Displacement", y ="Highway MPG")p# ==============================================================================# USECASE: Save a specific ggplot object to a fileggsave(filename ="pfinal.png", plot = p)# ==============================================================================# LESSON: Specify the size of the file to createggsave(filename ="pfinal2.png", plot = p, width =6, height =3, units ="in")# ==============================================================================# LESSON: Just change the extension to create a different file typeggsave(filename ="pfinal2.pdf", plot = p, width =6, height =3, units ="in")# ==============================================================================# PITFALL: Creating a very large file may lead to small textggsave(filename ="p_poster.png", plot = p, width =12, height =8, units ="in")# ==============================================================================# TIP: You can quickly increase the text size using base_sizep2 <- p +theme_grey(base_size =24)ggsave(filename ="p_poster2.png", plot = p2,width =12, height =8, units ="in")